The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.
There are two main goals for this case study.
Remark: please keep track with the most updated version of this write-up.
The data comes from several different sources:
In this case study, we use the following three cleaned data:
Among all data, the unique identifier of county is FIPS.
The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.
First read in the data.
# county-level socialeconomic information
county_data <- fread("data/covid_county.csv")
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates
covid_intervention <- fread("data/covid_intervention.csv")The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.
head(covid_intervention)head(county_data)head(covid_rate)
summary(covid_rate)The three datasets are "county_data", "covid_intervention" and "covid_rate". The first two share in common that each ovservation is a US county. County_data is an extremely wide dataset, it has 205 variables describing each county's socioeconomic demographics. Most variables have numeric values, but there are a few categorical variables describing the county's Type (e.g. Type_2015_Recreation_NO - whether the county mainly relies on recreation industry; RuralUrbanContinuumCode2013 - where the county lands on the Rural-urban continuum). Covid_intervention describes each county's measures towards covid. The variables are dates on which the county stopped and resumed "gathering", "public schools", "travel ban", "entertainment" and etc.
The third dataset, covid_rate, is extremely long. Because each observation is one day, and there is cumulative case total and death total for all the counties in all the US states from January 2020 to Feburary 2021 (for some counties, there is only data available for part of the timecourse). There's also an estimate of total population of that county in 2019.
It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.
covid_rate_sub <- covid_rate %>% filter(State=="New York" | State == "Washington" | State == "Florida")
day_state <- covid_rate_sub %>% group_by(State,date) %>% summarize(cum_cases_state = sum(cum_cases))## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
As shown in the r-code above, we first created a subset of covid_rate containing only NY, WA and FL. In order to get a new column "new cases by day by state", we will first need to sum cumylative cases in all the counties in the state to get the cumulative cases of the state. We did that by group covid_rate by state and by date then get a summarized sum.
diff_day<-rep(0,1112) #1112 is the length of daily_increment_State
daily_increment <- cbind(day_state, diff_day)## New names:
## * NA -> ...4
split_by_state <- split(daily_increment, daily_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_day= c(NA, diff(cum_cases_state))))
daily_increment <- do.call(rbind, split_by_state)Then, in order to get the difference of cum_cases(cumulative total of cases) between consecutive days, we first created a column "diff_day", filled it with zeros and appended it to our long dataset daily_increment_State. We then splitted the long dataset by State into 3 shorter datasets about each State. For each of the 3 States, we filled the first grid of 'diff_day' column with N/A since we don't have data about the date prior to the first day listed, and the following grids with the difference between cum_cases in this row and cum_cases in the previous row with the diff() function. An important assumption is that, within data of each state, covid-rate is already sorted by date. Conveniently, this assumption is already satisfied.
daily_increment_plot <- daily_increment %>%
ggplot(aes(x = date, y = diff_day, col = State)) +
geom_line() +
geom_point() +
theme_bw() +
ggtitle("Daily New COVID cases in NY, FL and WA")
ggplotly(daily_increment_plot +
theme(legend.position = "none"))The biggest difficulty to interpretation that the new cases by day graph presented is that the lines are not "smooth" - too many zigzagging. This means that this visualization might be presenting disturbances/noises that might distract our focus. For example, there is a big notable up-down for Florida between January 1st 2021 and January 2nd 2021. It is currently saying that there is 0 new cases on January 1st 2021 and over 30,000 new cases on January 2nd 2021. It is probably due to the fact that people didn't record the new cases on January 1st 2021, but that big jump is indeed disturbing.
A separate problem is that in the current analysis, we didn't take into account the population differences by the three states at all. So we wouldn't be able to tease apart how much of the smaller peak of Washington is due its smaller base population.
weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.week_state <- covid_rate %>% group_by(State,week)%>% mutate(cum_cases_week_state = sum(cum_cases))%>% ungroup() %>%
group_by(State) %>% mutate(statePopEst2019 = sum(TotalPopEst2019))%>% ungroup() %>%
select(State,week, cum_cases_week_state, statePopEst2019) %>% distinct(State, week, .keep_all = TRUE)
diff_week<-rep(0,2518) #2518 is the length of week_state
weekly_increment <- cbind(week_state, diff_week)
split_by_state <- split(weekly_increment, weekly_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_week= c(NA, diff(cum_cases_week_state))))
weekly_increment <- do.call(rbind, split_by_state)
weekly_increment <- weekly_increment %>% mutate(weekly_new_per100k = diff_week * 100000 /statePopEst2019 )In order to create the variable "weekly new per 100k", we first get cumulative total of every week for every state. We also need to summarize the total population in 2019 of all the counties in a state to get the population of a state. Once we get weekly cumulative and population by state, we repeat the process that we used to obtain daily new cases above to get weekly new cases of very state. The new variable "weekly new per 100k" for every state would be weekly new cases divided by the state's total population then multiplying 100,000.
weekly_increment_plot <- weekly_increment %>%
ggplot(aes(x = week, y = weekly_new_per100k, group = State, col = State)) +
geom_line() +
geom_point() +
theme_bw()+
ggtitle("Weekly New COVID Cases per 100k")+
theme(legend.position = "none")
weekly_increment_plotWe noticed that during earlier weeks of, there appeared to be negative weekly new cases, which is probably due to misreport. In order to more clearly examine the trend visually, we exclude all data point that showed negative weekly increase.
weekly_increment_excl <- weekly_increment %>% filter(weekly_new_per100k >= 0)weekly_increment_plot_excl <- weekly_increment_excl %>%
ggplot(aes(x = week, y = weekly_new_per100k, col = State)) +
geom_line() +
geom_point() +
theme_bw()+
ggtitle("Weekly New COVID Cases per 100k")
ggplotly(weekly_increment_plot_excl +
theme(legend.position = "none"))summary(covid_intervention)## FIPS STATE AREA_NAME stay at home
## Min. : 1000 Length:3272 Length:3272 Min. :2020-03-18
## 1st Qu.:19026 Class :character Class :character 1st Qu.:2020-03-25
## Median :30022 Mode :character Mode :character Median :2020-03-29
## Mean :31368 Mean :2020-03-29
## 3rd Qu.:46101 3rd Qu.:2020-04-03
## Max. :72153 Max. :2020-04-08
## NA's :576
## >50 gatherings >500 gatherings public schools
## Min. :2020-03-16 Min. :2020-03-12 Min. :2020-03-16
## 1st Qu.:2020-03-19 1st Qu.:2020-03-16 1st Qu.:2020-03-17
## Median :2020-03-22 Median :2020-03-19 Median :2020-03-18
## Mean :2020-03-22 Mean :2020-03-20 Mean :2020-03-19
## 3rd Qu.:2020-03-26 3rd Qu.:2020-03-26 3rd Qu.:2020-03-20
## Max. :2020-04-03 Max. :2020-04-03 Max. :2020-04-03
## NA's :200 NA's :200
## restaurant dine-in entertainment/gym federal guidelines
## Min. :2020-03-13 Min. :2020-03-13 Min. :2020-03-17
## 1st Qu.:2020-03-18 1st Qu.:2020-03-18 1st Qu.:2020-03-17
## Median :2020-03-20 Median :2020-03-20 Median :2020-03-17
## Mean :2020-03-20 Mean :2020-03-21 Mean :2020-03-17
## 3rd Qu.:2020-03-22 3rd Qu.:2020-03-24 3rd Qu.:2020-03-17
## Max. :2020-04-05 Max. :2020-04-07 Max. :2020-03-17
## NA's :67
## foreign travel ban stay at home rollback >50 gatherings rollback
## Min. :2020-03-12 Min. :2020-04-25 Min. :2020-05-02
## 1st Qu.:2020-03-12 1st Qu.:2020-05-05 1st Qu.:2020-05-16
## Median :2020-03-12 Median :2020-05-16 Median :2020-05-31
## Mean :2020-03-12 Mean :2020-05-17 Mean :2020-05-29
## 3rd Qu.:2020-03-12 3rd Qu.:2020-06-01 3rd Qu.:2020-06-13
## Max. :2020-03-12 Max. :2020-06-23 Max. :2020-07-09
## NA's :604 NA's :2020
## >500 gatherings rollback restaurant dine-in rollback
## Min. :2020-05-02 Min. :2020-04-25
## 1st Qu.:2020-05-17 1st Qu.:2020-05-05
## Median :2020-06-02 Median :2020-05-12
## Mean :2020-05-31 Mean :2020-05-15
## 3rd Qu.:2020-06-13 3rd Qu.:2020-05-19
## Max. :2020-07-09 Max. :2020-07-09
## NA's :2230 NA's :424
## entertainment/gym rollback
## Min. :2020-04-25
## 1st Qu.:2020-05-05
## Median :2020-05-12
## Mean :2020-05-16
## 3rd Qu.:2020-05-22
## Max. :2020-07-09
## NA's :564
The first pattern that attracts our attention is that there are variability of weekly new cases across time: for some period, there is growing number of new cases across weeks, whereas for some period the number of new cases are falling. And this pattern across time, is followed by every state. That is, there are there are apparently three peaks starting respectively around week 12 (April 5-11 2020), week 25 (July 5-11 2020) and week 43 (Nov 8- 14 2020). Peak is where the weekly increase culminate and start to decrease, so in order to find possible reasons to interpret the pattern, we mainly ask 1) what happend 2-3 weeks before the peak to accelerate the growth of new cases? 2) what happend right around the peak to make the curve start to wind down?
The first peak might be the natural culmination since the first detection of the pandemic without intervention. Through summarizing the first variables in covid_intervention, we would notice the mean date for the start of lockdown are all in the latter half of March. This probably explains why the first peak is in early April - as travel and interaction has been largely restricted since the first wave of lock down , weekly new cases begin to decrease.
However, weekly new cases began to be on the rise after it hit bottom in early June (week 21). Many factors may have contributed to this rise: after George Floyd's death on May 26, 2020, there was a peak of demonstrations and protests related to Black Life Matters in late May and early June in US. As shown in the following image from the report "Demonstrations & Political Violence in America: New Data for Summer 2020" by The Armed Conflict Location & Event Data Project (ACLED), in week of June 7, corresponding to our week 21, demonstrations related to Black Life Matters hit a peak, of more than 3000 movements in that week. On top of the series of heated civil movement, we can also learn from the summary of covid_intervention that most states issued policies of "lockdown rollback" in mid-late May. It also contributed to the rise of weekly new cases two weeks later.
“Demonstrations & Political Violence in America: New Data for Summer 2020, The Armed Conflict Location & Event Data Project (ACLED).”
Finally,the second peak passed and hit bottom in week 30, presumably as a consequence of another round of lockdown policies as well as subsided number of movement related to Black Life Matter. However, quickly it was on the rise again towards the biggest third peak. We don't have any lockdown data that could shed insights on the development of the third peak. However, we hypothesized that the large number of political gatherings leading to the presidential election contributed to the sharp growth. The fact that the third wave hit peak in the election week provided support for this hypothesis - since as the results settled, cases began to wind down. And we attributed the subsequent smaller peak on the way down in week 51 as resonance of winter holiday travel.
Other than the three-peak timecourse pattern shared by all the states, there is some variability among the states as well. For some states, the ups and downs in the curve are in really small scale, and in general, weekly new cases have been pretty low (e.g. Vermont, Maine, Washington). If we take a closer look, the leading states for the three peaks are different. New York is among the highes during the first peak. It makes sense because if the first peak is the culmination of the spreading of the cases through contact and travels without interventions - where else would experience more contact and travel than New York! Leading the second peak are Arizona and Florida, one hypothesis could be that hot weather is one contributing factor. South Dakoda, North Dakoda and Iowa are leading the third peak. Searching them in the covid_intervention dataset, we found that South Dakoda and North Dakoda both have NAs for fields "stay at home", "> 50 gathering" ">500 gatherings" and "entertainment", and Iowa had NAs for "stay at home". If NA means there was never policies spelling out the related lockdown order in these states, one hypothesis is that the lack of strict lockdown policies led to the observation that they are leading the third peak without being prevalent in the previous two waves.
covid_intervention to see whether the effectiveness of lockdown in flattening the curve.limits argument in scale_fill_gradient.)pop <- county_data %>%
group_by(State) %>%
summarise(
total.state.population = max(TotalPopEst2019,na.rm=TRUE),
new.fips = min(FIPS)
) %>%
rename(FIPS = new.fips)
pop <- merge(pop, county_data[,c(1,3)], sort=FALSE, by = "FIPS", all.x=TRUE)
pop <- pop %>%
rename(Abbr = State) %>%
rename(State = County)
head(pop)## FIPS Abbr total.state.population State
## 1 2000 AK 731545 Alaska
## 2 1000 AL 4903185 Alabama
## 3 5000 AR 3017804 Arkansas
## 4 4000 AZ 7278717 Arizona
## 5 6000 CA 39512223 California
## 6 8000 CO 5758736 Colorado
months <- c("2020-03-31", "2020-04-30", "2020-05-31", "2020-06-30", "2020-07-31", "2020-08-31", "2020-09-30", "2020-10-31", "2020-11-30", "2020-12-31", "2021-01-31", "2021-02-20")
months <- as.Date(months)
temp.covid_rate <- covid_rate %>%
filter(date %in% months)
temp.covid_rate <- temp.covid_rate %>%
group_by(State, date) %>%
summarise(
deaths = sum(cum_deaths)
) ## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
temp.covid_rate <- merge(temp.covid_rate, pop, by = "State", all.x = TRUE)
temp.covid_rate$mortality.rate <- (temp.covid_rate$deaths / temp.covid_rate$total.state.population) * 100000
data.s <- temp.covid_rate %>% rename(full=State, state=Abbr)
head(data.s)## full date deaths FIPS state total.state.population mortality.rate
## 1 Alabama 2020-03-31 14 1000 AL 4903185 0.286
## 2 Alabama 2020-04-30 272 1000 AL 4903185 5.547
## 3 Alabama 2020-05-31 630 1000 AL 4903185 12.849
## 4 Alabama 2020-06-30 950 1000 AL 4903185 19.375
## 5 Alabama 2020-07-31 1580 1000 AL 4903185 32.224
## 6 Alabama 2020-08-31 2182 1000 AL 4903185 44.502
max_mort_col <- quantile(data.s$mortality.rate, 1, na.rm = T)
min_mort_col <- quantile(data.s$mortality.rate, 0, na.rm = T)
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
monthly.plots <- list()
for (i in 1:12) {
data.s.plot <- data.s %>%
filter(date == months[i]) %>%
mutate(hover = paste(state, '<br>',
'mortality rate', round(mortality.rate, 2)))
fig <- plot_geo(data.s.plot, locationmode = 'USA-states')
fig <- fig %>% add_trace(
z = ~mortality.rate, text = ~hover, locations = ~state,
color = ~mortality.rate, colors = 'Blues',
zmin = min_mort_col, zmax = max_mort_col
)
fig <- fig %>% colorbar(title = "Mortality rate of state")
fig <- fig %>% layout(
title = toString(c("State Moratlity Rate", toString(months[i]))),
geo = g,
hoverlabel = list(bgcolor="white")
)
monthly.plots[[i]] <- fig
}
ggplotly(monthly.plots[[1]])ggplotly(monthly.plots[[2]])ggplotly(monthly.plots[[3]])ggplotly(monthly.plots[[4]])ggplotly(monthly.plots[[5]])ggplotly(monthly.plots[[6]])ggplotly(monthly.plots[[7]])ggplotly(monthly.plots[[8]])ggplotly(monthly.plots[[9]])ggplotly(monthly.plots[[10]])ggplotly(monthly.plots[[11]])ggplotly(monthly.plots[[12]])plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states?data.s$full <- tolower(data.s$full)
data.s <- data.s %>%
rename(region=full)
data.s$center_lat <- data.s$state
data.s$center_long <- data.s$state
states <- map_data("state")
map <- merge(states, data.s, sort=FALSE, by = "region", all.x=TRUE)
map <- map[order(map$order), ]
head(map)## region long lat group order subregion date deaths FIPS state
## 1 alabama -87.5 30.4 1 1 <NA> 2020-03-31 14 1000 AL
## 2 alabama -87.5 30.4 1 1 <NA> 2020-04-30 272 1000 AL
## 3 alabama -87.5 30.4 1 1 <NA> 2020-05-31 630 1000 AL
## 4 alabama -87.5 30.4 1 1 <NA> 2020-06-30 950 1000 AL
## 5 alabama -87.5 30.4 1 1 <NA> 2020-07-31 1580 1000 AL
## 6 alabama -87.5 30.4 1 1 <NA> 2020-08-31 2182 1000 AL
## total.state.population mortality.rate center_lat center_long
## 1 4903185 0.286 AL AL
## 2 4903185 5.547 AL AL
## 3 4903185 12.849 AL AL
## 4 4903185 19.375 AL AL
## 5 4903185 32.224 AL AL
## 6 4903185 44.502 AL AL
data.s.plot <- data.s %>%
mutate(hover = paste(state, '<br>',
'mortality rate', round(mortality.rate, 2))) %>%
mutate(date = lubridate::month(date) )
data.s.plot$date <- ifelse(data.s.plot$date %in% c(1,2), data.s.plot$date + 10, data.s.plot$date - 2)
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
fig <-plot_geo(data.s.plot, locationmode = 'USA-states')
fig <- fig %>% add_trace(
z = ~mortality.rate, text = ~hover, locations = ~state, frame= ~date,
color = ~mortality.rate, colors = 'Blues',
zmin = min_mort_col, zmax = max_mort_col
)
fig <- fig %>% colorbar(title = "Mortality rate of state")
fig <- fig %>% layout(
title = 'Cumulative Mortality Rate of State: March 2020 - Feb 2021',
geo = g,
hoverlabel = list(bgcolor="white")
) %>%
animation_slider(
currentvalue = list(prefix = "Month: ")
)
figWe now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.
total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.covid_factor <- covid_rate[which(covid_rate$date == "2021-02-01")]
covid_factor$total_death_per100k <- (covid_factor$cum_deaths / covid_factor$TotalPopEst2019) * 100000
covid_factor$log_total_death_per100k <- log(covid_factor$total_death_per100k + 1)
covid_factor <- covid_factor[, c(1,5,6,7,10)]
covid_factor <- covid_factor %>% merge(county_data, by = "FIPS", all=TRUE)
length(unique(county_data$FIPS))## [1] 3278
length(unique(covid_rate$FIPS))## [1] 3108
length(unique(covid_factor$FIPS))## [1] 3278
#all nas are either total state/country level information, low-population countys in alaska, or Puerto Rico
covid_factor[which(is.na(covid_factor$log_total_death_per100k)), c(1,5,6,7,18)]## FIPS log_total_death_per100k State County TotalPopEst2019
## 1: 0 NA US United States 328239523
## 2: 1000 NA AL Alabama 4903185
## 3: 2000 NA AK Alaska 731545
## 4: 2010 NA AK Aleutian Islands NA
## 5: 2013 NA AK Aleutians East 3337
## ---
## 167: 72147 NA PR Vieques 8386
## 168: 72149 NA PR Villalba 21372
## 169: 72151 NA PR Yabucoa 32282
## 170: 72153 NA PR Yauco 33575
## 171: 72153 NA PR Yauco 33860
#remove all NAs
covid_factor <- covid_factor %>%
filter(log_total_death_per100k != "NA")
head(covid_factor)## FIPS cum_cases cum_deaths week log_total_death_per100k State County
## 1: 1001 5683 69 55 4.82 AL Autauga
## 2: 1003 18211 224 55 4.62 AL Baldwin
## 3: 1005 1956 40 55 5.09 AL Barbour
## 4: 1007 2309 52 55 5.45 AL Bibb
## 5: 1009 5720 100 55 5.16 AL Blount
## 6: 1011 1089 29 55 5.66 AL Bullock
## MedHHInc PerCapitaInc PovertyUnder18Pct PovertyAllAgesPct Deep_Pov_All
## 1: 59338 29372 19.3 13.8 6.14
## 2: 57588 31203 13.9 9.8 4.48
## 3: 34382 18461 43.9 30.9 12.75
## 4: 46064 20199 27.8 21.8 5.90
## 5: 50412 22656 18.0 13.2 7.31
## 6: 29267 20346 68.3 42.5 17.66
## Deep_Pov_Children PovertyUnder18Num PovertyAllAgesNum PopChangeRate1819
## 1: 8.91 2509 7587 0.605
## 2: 6.21 6442 21069 2.469
## 3: 26.71 2242 6788 -0.748
## 4: 9.26 1238 4400 0.121
## 5: 12.33 2374 7527 0.095
## 6: 44.64 1433 3610 -0.718
## PopChangeRate1019 TotalPopEst2019 NetMigrationRate1019 NaturalChangeRate1019
## 1: 2.001 55869 0.686 1.315
## 2: 21.911 223234 21.001 0.910
## 3: -9.664 24686 -8.797 -0.867
## 4: -2.081 22394 -2.099 0.017
## 5: 0.784 57826 -0.005 0.790
## 6: -7.126 10101 -7.769 0.644
## Net_International_Migration_Rate_2010_2019 PopChangeRate0010
## 1: -0.029 24.96
## 2: 0.714 29.80
## 3: 0.161 -5.44
## 4: 0.525 10.03
## 5: 0.207 12.34
## 6: 0.699 -6.83
## NetMigrationRate0010 NaturalChangeRate0010 Immigration_Rate_2000_2010
## 1: 11.87 5.46 -0.0102
## 2: 26.17 3.32 1.5845
## 3: -4.80 2.29 1.8284
## 4: 6.43 2.10 0.3415
## 5: 12.27 3.14 2.0770
## 6: -9.13 3.20 3.6612
## PopDensity2010 Under18Pct2010 Age65AndOlderPct2010 WhiteNonHispanicPct2010
## 1: 91.8 26.8 12.0 77.2
## 2: 114.7 23.0 16.8 83.5
## 3: 31.0 21.9 14.2 46.8
## 4: 36.8 22.7 12.7 75.0
## 5: 88.9 24.6 14.7 88.9
## 6: 17.5 22.3 13.5 21.9
## BlackNonHispanicPct2010 AsianNonHispanicPct2010
## 1: 17.58 0.86
## 2: 9.31 0.74
## 3: 46.69 0.39
## 4: 21.92 0.10
## 5: 1.26 0.20
## 6: 69.97 0.18
## NativeAmericanNonHispanicPct2010 HispanicPct2010 MultipleRacePct2010
## 1: 0.40 2.40 1.59
## 2: 0.63 4.38 1.49
## 3: 0.22 5.05 0.94
## 4: 0.28 1.77 0.89
## 5: 0.50 8.07 1.19
## 6: 0.18 7.12 0.79
## NonHispanicWhitePopChangeRate0010 NonHispanicBlackPopChangeRate0010
## 1: 21.05 29.17
## 2: 25.92 18.17
## 3: -13.19 -4.11
## 4: 8.32 9.60
## 5: 8.41 21.07
## 6: -13.46 -10.00
## NonHispanicAsianPopChangeRate0010 NonHispanicNativeAmericanPopChangeRate0010
## 1: 140.72 16.7
## 2: 152.35 52.2
## 3: 28.92 -49.6
## 4: 29.41 39.1
## 5: 64.29 34.4
## 6: -4.76 -46.0
## HispanicPopChangeRate0010 MultipleRacePopChangeRate0010
## 1: 114.8 114.57
## 2: 224.1 85.74
## 3: 190.2 21.70
## 4: 93.3 89.72
## 5: 70.2 31.79
## 6: 141.3 4.88
## WhiteNonHispanicNum2010 BlackNonHispanicNum2010 AsianNonHispanicNum2010
## 1: 42154 9595 467
## 2: 152200 16966 1340
## 3: 12837 12820 107
## 4: 17191 5024 22
## 5: 50952 724 115
## 6: 2392 7637 20
## NativeAmericanNonHispanicNum2010 HispanicNum2010 MultipleRaceNum2010
## 1: 217 1310 869
## 2: 1146 7992 2723
## 3: 60 1387 258
## 4: 64 406 203
## 5: 285 4626 684
## 6: 20 777 86
## ForeignBornPct ForeignBornEuropePct ForeignBornMexPct NonEnglishHHPct
## 1: 2.018 0.3351 0.346 0.654
## 2: 3.445 0.7717 0.820 1.425
## 3: 2.506 0.2405 1.299 1.470
## 4: 1.443 0.0355 0.169 0.687
## 5: 4.359 0.2897 3.131 1.461
## 6: 0.927 0.1739 0.000 1.053
## Ed1LessThanHSPct Ed2HSDiplomaOnlyPct Ed3SomeCollegePct Ed4AssocDegreePct
## 1: 11.31 32.6 20.3 8.07
## 2: 9.74 27.6 22.0 9.36
## 3: 26.97 35.7 18.1 7.04
## 4: 16.79 47.3 18.6 5.75
## 5: 19.84 34.0 21.4 12.05
## 6: 24.77 39.7 17.0 5.31
## Ed5CollegePlusPct AvgHHSize FemaleHHPct HH65PlusAlonePct OwnHomePct
## 1: 27.7 2.59 11.39 10.5 74.9
## 2: 31.3 2.61 9.52 12.8 73.6
## 3: 12.2 2.49 19.33 14.5 61.4
## 4: 11.5 2.99 13.65 12.9 75.1
## 5: 12.6 2.77 9.60 11.0 78.6
## 6: 13.3 2.75 24.47 10.8 75.5
## ForeignBornNum TotalPopACS ForeignBornAfricaPct Ed3SomeCollegeNum
## 1: 1114 55200 0.0000 7554
## 2: 7170 208107 0.0913 32266
## 3: 646 25782 0.0000 3287
## 4: 325 22527 0.0000 2938
## 5: 2513 57645 0.1145 8492
## 6: 96 10352 0.1449 1205
## Ed2HSDiplomaOnlyNum Ed1LessThanHSNum TotalPop25Plus Ed5CollegePlusNum
## 1: 12119 4204 37166 10291
## 2: 40579 14310 146989 46075
## 3: 6486 4901 18173 2220
## 4: 7471 2650 15780 1813
## 5: 13489 7861 39627 5010
## 6: 2817 1760 7104 945
## TotalOccHU ForeignBornAsiaPct Ed4AssocDegreeNum ForeignBornEuropeNum
## 1: 21115 0.884 2998 185
## 2: 78622 0.559 13759 1606
## 3: 9186 0.310 1279 62
## 4: 6840 0.226 908 8
## 5: 20600 0.241 4775 167
## 6: 3609 0.222 377 18
## NonEnglishHHNum HH65PlusAloneNum OwnHomeNum FemaleHHNum TotalHH
## 1: 138 2226 15814 2405 21115
## 2: 1120 10093 57881 7485 78622
## 3: 135 1333 5640 1776 9186
## 4: 47 885 5135 934 6840
## 5: 301 2276 16197 1977 20600
## 6: 38 388 2726 883 3609
## ForeignBornCentralSouthAmPct ForeignBornCentralSouthAmNum
## 1: 0.636 351
## 2: 1.518 3160
## 3: 1.870 482
## 4: 1.057 238
## 5: 3.528 2034
## 6: 0.319 33
## ForeignBornCaribPct ForeignBornCaribNum ForeignBornAfricaNum
## 1: 0.1196 66 0
## 2: 0.3681 766 190
## 3: 0.0116 3 0
## 4: 0.1243 28 0
## 5: 0.1041 60 66
## 6: 0.0676 7 15
## ForeignBornAsiaNum ForeignBornMexNum LandAreaSQMiles2010
## 1: 488 191 594
## 2: 1164 1707 1590
## 3: 80 335 885
## 4: 51 38 623
## 5: 139 1805 645
## 6: 23 0 623
## Age65AndOlderNum2010 TotalPop2010 Under18Num2010
## 1: 6546 54571 14613
## 2: 30568 182265 41898
## 3: 3909 27457 6015
## 4: 2906 22915 5201
## 5: 8439 57322 14106
## 6: 1469 10914 2430
## Net_International_Migration_2000_2010 NaturalChangeNum0010
## 1: -4.5 2404
## 2: 2239.5 4686
## 3: 530.5 664
## 4: 68.0 418
## 5: 1061.5 1606
## 6: 424.0 371
## NetMigrationNum0010 TotalPopEst2012 TotalPopEst2013 TotalPopEst2010
## 1: 5224 54954 54727 54773
## 2: 36984 190145 194885 183112
## 3: -1393 27169 26937 27327
## 4: 1280 22667 22521 22870
## 5: 6268 57580 57619 57376
## 6: -1058 10606 10549 10876
## TotalPopEst2014 TotalPopEst2011 Net_International_Migration_2010_2019
## 1: 54893 55227 -16
## 2: 199183 186558 1307
## 3: 26755 27341 44
## 4: 22553 22745 120
## 5: 57526 57560 119
## 6: 10663 10675 76
## NaturalChange1019 TotalPopEst2015 TotalPopEst2016 TotalPopEst2017
## 1: 720 54864 55243 55390
## 2: 1667 202939 207601 212521
## 3: -237 26283 25806 25157
## 4: 4 22566 22586 22550
## 5: 453 57526 57494 57787
## 6: 70 10400 10389 10176
## NetMigration1019 TotalPopEst2018 TotalPopEstBase2010 UnempRate2019
## 1: 376 55533 54597 2.7
## 2: 38455 217855 182265 2.7
## 3: -2404 24872 27455 3.8
## 4: -480 22367 22915 3.1
## 5: -3 57771 57322 2.7
## 6: -845 10174 10911 3.6
## UnempRate2018 UnempRate2017 UnempRate2016 UnempRate2015 UnempRate2014
## 1: 3.6 3.9 5.1 5.2 5.8
## 2: 3.6 4.1 5.3 5.5 6.1
## 3: 5.1 5.8 8.3 8.9 10.5
## 4: 3.9 4.4 6.4 6.6 7.2
## 5: 3.5 4.0 5.4 5.4 6.1
## 6: 4.6 4.9 6.8 7.9 8.8
## UnempRate2010 UnempRate2007 PctEmpChange1019 PctEmpChange1819
## 1: 8.9 3.3 8.65 0.78
## 2: 10.0 3.1 26.03 3.12
## 3: 12.3 6.3 -8.33 2.83
## 4: 11.4 4.1 6.38 1.83
## 5: 9.8 3.2 9.77 1.88
## 6: 11.8 9.4 6.59 1.58
## PctEmpChange0719 PctEmpChange0710 PctEmpAgriculture PctEmpMining
## 1: 7.98 -0.62 0.551 0.332
## 2: 18.20 -6.22 0.956 0.372
## 3: -15.19 -7.49 4.404 0.115
## 4: -0.15 -6.14 1.901 4.087
## 5: -4.36 -12.88 1.485 0.796
## 6: 40.36 31.68 5.939 0.000
## PctEmpConstruction PctEmpManufacturing PctEmpTrade PctEmpTrans
## 1: 6.00 12.95 11.65 6.98
## 2: 8.27 9.12 16.80 4.93
## 3: 6.25 23.89 13.41 6.92
## 4: 9.01 18.71 10.87 4.89
## 5: 9.46 16.41 15.36 6.86
## 6: 4.92 33.76 7.41 5.20
## PctEmpInformation PctEmpFIRE PctEmpServices PctEmpGovt NumCivEmployed
## 1: 1.629 6.26 43.4 10.25 24124
## 2: 1.370 7.40 45.7 5.07 93379
## 3: 0.241 3.97 33.7 7.10 8720
## 4: 1.309 4.58 39.4 5.27 8099
## 5: 1.251 5.84 38.0 4.59 21346
## 6: 0.431 2.31 33.5 6.55 3940
## UnempRate2011 NumCivLaborForce2011 NumEmployed2011 NumCivLaborForce2012
## 1: 8.4 25836 23677 25740
## 2: 9.0 85045 77418 84414
## 3: 11.5 9849 8712 9362
## 4: 10.5 8933 7996 8798
## 5: 8.7 25123 22939 24960
## 6: 11.6 4833 4272 4736
## NumUnemployed2010 NumCivLaborForce2008 NumUnemployed2011 NumEmployed2010
## 1: 2282 24687 2159 23431
## 2: 8339 83223 7627 75120
## 3: 1262 10161 1137 8959
## 4: 1020 8749 937 7914
## 5: 2446 26698 2184 22460
## 6: 585 3634 561 4356
## NumCivLaborForce2010 NumUnemployed2009 NumEmployed2009 NumCivLaborForce2009
## 1: 25713 2402 22301 24703
## 2: 83459 8048 74403 82451
## 3: 10221 1431 8572 10003
## 4: 8934 1161 7581 8742
## 5: 24906 2648 23832 26480
## 6: 4941 583 3156 3739
## UnempRate2008 UnempRate2012 NumEmployed2008 UnempRate2009 NumUnemployed2008
## 1: 5.1 6.9 23420 9.7 1267
## 2: 4.6 7.5 79372 9.8 3851
## 3: 8.8 11.5 9267 14.3 894
## 4: 5.8 8.5 8241 13.3 508
## 5: 4.7 6.9 25453 10.0 1245
## 6: 10.5 10.4 3251 15.6 383
## NumUnemployed2015 NumUnemployed2019 NumCivLaborforce2019 NumUnemployed2018
## 1: 1331 714 26172 935
## 2: 4862 2653 97328 3424
## 3: 765 324 8537 427
## 4: 568 266 8685 337
## 5: 1323 676 25331 868
## 6: 379 175 4818 220
## NumEmployed2018 NumCivLaborforce2018 NumUnemployed2017 NumEmployed2017
## 1: 25261 26196 1013 25062
## 2: 91809 95233 3745 88711
## 3: 7987 8414 486 7863
## 4: 8268 8605 375 8208
## 5: 24201 25069 998 23824
## 6: 4571 4791 238 4614
## NumCivLaborforce2017 NumUnemployed2016 NumEmployed2016 NumCivLaborforce2016
## 1: 26075 1322 24709 26031
## 2: 92456 4835 86060 90895
## 3: 8349 700 7736 8436
## 4: 8583 556 8088 8644
## 5: 24822 1326 23358 24684
## 6: 4852 330 4514 4844
## NumCivLaborforce2013 NumEmployed2015 NumEmployed2012 NumUnemployed2014
## 1: 25810 24321 23961 1495
## 2: 85280 83010 78065 5301
## 3: 9099 7860 8283 932
## 4: 8705 8021 8047 617
## 5: 24887 23198 23244 1504
## 6: 4778 4406 4245 418
## NumEmployed2014 NumCivLaborforce2014 UnempRate2013 NumUnemployed2013
## 1: 24097 25592 6.2 1605
## 2: 81083 86384 6.6 5654
## 3: 7913 8845 10.2 931
## 4: 7942 8559 7.9 689
## 5: 23023 24527 6.3 1562
## 6: 4314 4732 9.4 447
## NumEmployed2019 NumEmployed2013 NumUnemployed2007 NumEmployed2007
## 1: 25458 24205 806 23577
## 2: 94675 79626 2560 80099
## 3: 8213 8168 650 9684
## 4: 8419 8016 359 8432
## 5: 24655 23325 849 25780
## 6: 4643 4331 345 3308
## NumCivLaborforce2007 NumUnemployed2012 NumCivLaborforce2015
## 1: 24383 1779 25652
## 2: 82659 6349 87872
## 3: 10334 1079 8625
## 4: 8791 751 8589
## 5: 26629 1716 24521
## 6: 3653 491 4785
## RuralUrbanContinuumCode2013 UrbanInfluenceCode2013
## 1: 2 2
## 2: 3 2
## 3: 6 6
## 4: 1 1
## 5: 1 1
## 6: 6 6
## RuralUrbanContinuumCode2003 UrbanInfluenceCode2003 Metro2013 Nonmetro2013
## 1: 2 2 1 0
## 2: 4 5 1 0
## 3: 6 6 0 1
## 4: 1 1 1 0
## 5: 1 1 1 0
## 6: 6 6 0 1
## Micropolitan2013 Type_2015_Update Type_2015_Farming_NO
## 1: 0 0 0
## 2: 0 5 0
## 3: 0 3 0
## 4: 0 0 0
## 5: 0 0 0
## 6: 0 3 0
## Type_2015_Manufacturing_NO Type_2015_Mining_NO Type_2015_Government_NO
## 1: 0 0 0
## 2: 0 0 0
## 3: 1 0 0
## 4: 0 0 0
## 5: 0 0 0
## 6: 1 0 0
## Type_2015_Recreation_NO Low_Education_2015_update Low_Employment_2015_update
## 1: 0 0 0
## 2: 1 0 0
## 3: 0 1 1
## 4: 0 1 1
## 5: 0 1 1
## 6: 0 1 1
## Population_loss_2015_update Retirement_Destination_2015_Update
## 1: 0 1
## 2: 0 1
## 3: 0 0
## 4: 0 0
## 5: 0 0
## 6: 0 0
## Perpov_1980_0711 PersistentChildPoverty_1980_2011 Hipov HiAmenity
## 1: 0 0 0 0
## 2: 0 0 0 1
## 3: 1 1 1 0
## 4: 0 1 0 0
## 5: 0 0 0 0
## 6: 1 1 1 0
## HiCreativeClass2000 Gas_Change Oil_Change Oil_Gas_Change Metro2003
## 1: 1 0 0 0 1
## 2: 1 9 0 9 0
## 3: 0 0 0 0 0
## 4: 0 0 0 0 1
## 5: 0 0 0 0 1
## 6: 0 0 0 0 0
## NonmetroNotAdj2003 NonmetroAdj2003 Noncore2003 EconomicDependence2000
## 1: 0 0 0 3
## 2: 0 1 0 5
## 3: 0 1 1 3
## 4: 0 0 0 6
## 5: 0 0 0 6
## 6: 0 1 1 1
## Nonmetro2003 Micropolitan2003 FarmDependent2003 ManufacturingDependent2000
## 1: 0 0 0 1
## 2: 1 1 0 0
## 3: 1 0 0 1
## 4: 0 0 0 0
## 5: 0 0 0 0
## 6: 1 0 1 0
## LowEducation2000 RetirementDestination2000 PersistentPoverty2000 Noncore2013
## 1: 0 0 0 0
## 2: 0 1 0 0
## 3: 1 0 1 1
## 4: 1 1 1 0
## 5: 0 1 0 0
## 6: 1 0 1 1
## Type_2015_Nonspecialized_NO Metro_Adjacent2013 PersistentChildPoverty2004
## 1: 1 0 0
## 2: 0 0 0
## 3: 0 1 1
## 4: 1 0 1
## 5: 1 0 0
## 6: 0 1 1
## RecreationDependent2000
## 1: 0
## 2: 1
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.
Report missing values in your final subset of variables.
In the following anaylsis, you may ignore the missing values.
county_data_sub <- county_data %>%
select(County, State, FIPS, Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, UnempRate2019, PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation, PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, PopDensity2010, OwnHomePct, Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, ForeignBornPct, Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, NaturalChangeRate1019, TotalPopEst2019, WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010, BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, Type_2015_Update, RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, Perpov_1980_0711, HiCreativeClass2000, HiAmenity, Retirement_Destination_2015_Update)library(leaps)
#using Log of total deaths per 100k, and using forward selection
fit.forward <- regsubsets(log_total_death_per100k ~ ., covid_factor , nvmax=25, method="forward")## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 27 linear dependencies found
## Reordering variables and trying again:
f.f <- summary(fit.forward)
fit.forward.var <- f.f$which
colnames(fit.forward.var)[fit.forward.var[5,]]## [1] "(Intercept)" "StateKY" "StateVT"
## [4] "MultipleRacePct2010" "Ed5CollegePlusPct" "FemaleHHPct"
#(Intercept), StateKY, StateVT, MultipleRacePct2010, Ed5CollegePlusPct, FemaleHHPctfit <- lm(log_total_death_per100k ~ State + Ed5CollegePlusPct + FemaleHHPct, covid_factor)
summary(fit)##
## Call:
## lm(formula = log_total_death_per100k ~ State + Ed5CollegePlusPct +
## FemaleHHPct, data = covid_factor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.656 -0.253 0.083 0.400 2.721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.86584 0.12813 37.98 < 2e-16 ***
## StateAR 0.01516 0.13944 0.11 0.91345
## StateAZ 0.28936 0.23677 1.22 0.22177
## StateCA -0.89151 0.14980 -5.95 3.0e-09 ***
## StateCO -0.67131 0.14867 -4.52 6.6e-06 ***
## StateCT 0.34313 0.31184 1.10 0.27126
## StateDC 0.32143 0.83762 0.38 0.70120
## StateDE -0.19134 0.48924 -0.39 0.69576
## StateFL -0.13936 0.14362 -0.97 0.33193
## StateGA -0.11282 0.12071 -0.93 0.35006
## StateIA 0.35833 0.13437 2.67 0.00770 **
## StateID -0.62016 0.16311 -3.80 0.00015 ***
## StateIL 0.23992 0.13173 1.82 0.06865 .
## StateIN 0.00495 0.13437 0.04 0.97060
## StateKS -0.67912 0.13297 -5.11 3.5e-07 ***
## StateKY -0.79609 0.12706 -6.27 4.2e-10 ***
## StateLA 0.09333 0.14490 0.64 0.51955
## StateMA 0.03595 0.24698 0.15 0.88428
## StateMD -0.21210 0.19871 -1.07 0.28588
## StateME -1.46958 0.23209 -6.33 2.8e-10 ***
## StateMI -0.10709 0.13792 -0.78 0.43752
## StateMN -0.22115 0.13777 -1.61 0.10855
## StateMO -0.37610 0.12905 -2.91 0.00359 **
## StateMS 0.11317 0.13743 0.82 0.41031
## StateMT -0.02417 0.15380 -0.16 0.87511
## StateNC -0.49066 0.13106 -3.74 0.00018 ***
## StateND 0.36006 0.15637 2.30 0.02137 *
## StateNE -0.60296 0.13693 -4.40 1.1e-05 ***
## StateNH -0.84115 0.28290 -2.97 0.00297 **
## StateNJ 0.56256 0.20986 2.68 0.00739 **
## StateNM -0.37707 0.17658 -2.14 0.03280 *
## StateNV -0.91057 0.22628 -4.02 5.9e-05 ***
## StateNY -0.25719 0.14752 -1.74 0.08137 .
## StateOH -0.41665 0.13528 -3.08 0.00209 **
## StateOK -0.46489 0.13921 -3.34 0.00085 ***
## StateOR -1.17559 0.17266 -6.81 1.2e-11 ***
## StatePA 0.21858 0.14470 1.51 0.13100
## StateRI -0.01702 0.38616 -0.04 0.96486
## StateSC -0.09686 0.15887 -0.61 0.54211
## StateSD 0.47093 0.14554 3.24 0.00123 **
## StateTN 0.03739 0.13278 0.28 0.77828
## StateTX 0.09653 0.11443 0.84 0.39896
## StateUT -1.20255 0.18672 -6.44 1.4e-10 ***
## StateVA -0.56371 0.12533 -4.50 7.1e-06 ***
## StateVT -2.23193 0.24585 -9.08 < 2e-16 ***
## StateWA -1.18192 0.16862 -7.01 2.9e-12 ***
## StateWI -0.12905 0.14332 -0.90 0.36794
## StateWV -0.66306 0.15187 -4.37 1.3e-05 ***
## StateWY -0.22907 0.20306 -1.13 0.25935
## Ed5CollegePlusPct -0.01579 0.00181 -8.74 < 2e-16 ***
## FemaleHHPct 0.04162 0.00448 9.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.828 on 3057 degrees of freedom
## Multiple R-squared: 0.292, Adjusted R-squared: 0.28
## F-statistic: 25.2 on 50 and 3057 DF, p-value: <2e-16
covid_factor[which(is.na(covid_factor$Ed5CollegePlusPct))]## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
covid_factor[which(is.na(covid_factor$State))]## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
covid_factor[which(is.na(covid_factor$FemaleHHPct))]## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
#There are no missing values in our final selection of variablesUse LASSO to choose a parsimonious model with all available sensible county-level information. Force in State in the process. Why we need to force in State?
Use Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.
If necessary, reduce the model from iv) to a final model with all variables being siginificant at 0.05 level. Are the linear model assumptions all reasonably met?
It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinxs is higher. Does your analysis support these arguments?
Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.
What else can we do to improve our model? What other important information we may have missed?
A detailed summary of the variables in each data set follows:
Infection and fatality data
Socioeconomic demographics
Income: Poverty level and household income
Jobs: Employment type, rate, and change
UnempRate2007-2019: Unemployment rate, 2007-2019
NumEmployed2007-2019: Employed, 2007-2019
NumUnemployed2007-2019: Unemployed, 2007-2019
PctEmpChange1019: Percent employment change, 2010-19
PctEmpChange1819: Percent employment change, 2018-19
PctEmpChange0719: Percent employment change, 2007-19
PctEmpChange0710: Percent employment change, 2007-10
NumCivEmployed: Civilian employed population 16 years and over, 2014-18
NumCivLaborforce2007-2019: Civilian labor force, 2007-2019
PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18
PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18
PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18
PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18
PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18
PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18
PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18
PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18
PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18
PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18
People: Population size, density, education level, race, age, household size, and migration rates
PopDensity2010: Population density, 2010
LandAreaSQMiles2010: Land area in square miles, 2010
TotalHH: Total number of households, 2014-18
TotalOccHU: Total number of occupied housing units, 2014-18
AvgHHSize: Average household size, 2014-18
OwnHomeNum: Number of owner occupied housing units, 2014-18
OwnHomePct: Percent of owner occupied housing units, 2014-18
NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18
HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18
FemaleHHPct: Percent of female headed family households of total households, 2014-18
FemaleHHNum: Number of female headed family households, 2014-18
NonEnglishHHNum: Number of non-English speaking households, 2014-18
HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18
Age65AndOlderPct2010: Percent of population 65 or older, 2010
Age65AndOlderNum2010: Population 65 years or older, 2010
TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average
Under18Pct2010: Percent of population under age 18, 2010
Under18Num2010: Population under age 18, 2010
Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18
Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18
Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18
Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18
Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18
Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18
ForeignBornPct: Percent of total population foreign born, 2014-18
ForeignBornEuropePct: Percent of persons born in Europe, 2014-18
ForeignBornMexPct: Percent of persons born in Mexico, 2014-18
ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18
ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18
ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18
ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18
ForeignBornNum: Number of people foreign born, 2014-18
ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18
ForeignBornEuropeNum: Number of persons born in Europe, 2014-18
ForeignBornMexNum: Number of persons born in Mexico, 2014-18
ForeignBornAfricaNum: Number of persons born in Africa, 2014-18
ForeignBornAsiaNum: Number of persons born in Asia, 2014-18
ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18
Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19
Net_International_Migration_2010_2019: Net international migration, 2010-19
Net_International_Migration_2000_2010: Net international migration, 2000-10
Immigration_Rate_2000_2010: Net international migration rate, 2000-10
NetMigrationRate0010: Net migration rate, 2000-10
NetMigrationRate1019: Net migration rate, 2010-19
NetMigrationNum0010: Net migration, 2000-10
NetMigration1019: Net Migration, 2010-19
NaturalChangeRate1019: Natural population change rate, 2010-19
NaturalChangeRate0010: Natural population change rate, 2000-10
NaturalChangeNum0010: Natural change, 2000-10
NaturalChange1019: Natural population change, 2010-19
TotalPop2010: Population size 4/1/2010 Census
TotalPopEst2010: Population size 7/1/2010
TotalPopEst2011: Population size 7/1/2011
TotalPopEst2012: Population size 7/1/2012
TotalPopEst2013: Population size 7/1/2013
TotalPopEst2014: Population size 7/1/2014
TotalPopEst2015: Population size 7/1/2015
TotalPopEst2016: Population size 7/1/2016
TotalPopEst2017: Population size 7/1/2017
TotalPopEst2018: Population size 7/1/2018
TotalPopEst2019: Population size 7/1/2019
TotalPopACS: Total population, 2014-18 - 5-year average
TotalPopEstBase2010: County Population estimate base 4/1/2010
NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10
PopChangeRate1819: Population change rate, 2018-19
PopChangeRate1019: Population change rate, 2010-19
PopChangeRate0010: Population change rate, 2000-10
NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10
HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10
MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10
NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10
NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10
MultipleRacePct2010: Percent multiple race, 2010
WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010
NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010
BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010
AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010
HispanicPct2010: Percent Hispanic, 2010
MultipleRaceNum2010: Population size multiple race, 2010
WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010
BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010
NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010
AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010
HispanicNum2010: Population size Hispanic, 2010
County classifications: Type of county (rural or urban on a rural-urban continuum scale)
Type_2015_Recreation_NO: Recreation counties, 2015 edition
Type_2015_Farming_NO: Farming-dependent counties, 2015 edition
Type_2015_Mining_NO: Mining-dependent counties, 2015 edition
Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition
Type_2015_Update: County typology economic types, 2015 edition
Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition
Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition
RecreationDependent2000: Nonmetro recreation-dependent, 1997-00
ManufacturingDependent2000: Manufacturing-dependent, 1998-00
FarmDependent2003: Farm-dependent, 1998-00
EconomicDependence2000: Economic dependence, 1998-00
RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003
UrbanInfluenceCode2003: Urban influence code, 2003
RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013
UrbanInfluenceCode2013: Urban influence code, 2013
Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013
Micropolitan2013: Micropolitan, 2013
Nonmetro2013: Nonmetro, 2013
Metro2013: Metro, 2013
Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013
Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003
Micropolitan2003: Micropolitan, 2003
Metro2003: Metro, 2003
Nonmetro2003: Nonmetro, 2003
NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003
NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003
Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11
Gas_Change: Change in the value of onshore natural gas production, 2000-11
Oil_Change: Change in the value of onshore oil production, 2000-11
Hipov: High poverty counties, 2014-18
Perpov_1980_0711: Persistent poverty counties, 2015 edition
PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition
PersistentChildPoverty2004: Persistent child poverty counties, 2004
PersistentPoverty2000: Persistent poverty counties, 2004
Low_Education_2015_update: Low education counties, 2015 edition
LowEducation2000: Low education, 2000
HiCreativeClass2000: Creative class, 2000
HiAmenity: High natural amenities
RetirementDestination2000: Retirement destination, 1990-00
Low_Employment_2015_update: Low employment counties, 2015 edition
Population_loss_2015_update: Population loss counties, 2015 edition
Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition
The raw data sets are dirty and need transforming before we can do our EDA. It takes time and efforts to clean and merge different data sources so we provide the final output of the cleaned and merged data. The cleaning procedure is as follows. Please read through to understand what is in the cleaned data. We set eval = data_cleaned in the following cleaning chunks so that these cleaning chunks will only run if any of data/covid_county.csv, data/covid_rates.csv or data/covid_intervention.csv does not exist.
# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") &
file.exists("data/covid_rates.csv") &
file.exists("data/covid_intervention.csv"))We first read in the table using data.table::fread(), as we did last time.
# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))
# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))
# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))The original NYC data contains more information than we need. We extract only the number of cases and deaths and format it the same as the covid_rates data.
# NYC county fips matching table
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
County = c("BX", "BK", "MN", "QN", "SI"))
# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_CASE_COUNT,
BK = BK_CASE_COUNT,
MN = MN_CASE_COUNT,
QN = QN_CASE_COUNT,
SI = SI_CASE_COUNT)]
nyc_case %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "cases") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_cases = cumsum(cases))
# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_DEATH_COUNT,
BK = BK_DEATH_COUNT,
MN = MN_DEATH_COUNT,
QN = QN_DEATH_COUNT,
SI = SI_DEATH_COUNT)]
nyc_death %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "deaths") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_deaths = cumsum(deaths))
nyc_rates <- merge(nyc_case,
nyc_death,
by = c("date", "County"),
all.x= T)
nyc_rates <- merge(nyc_rates,
nyc_fips,
by = "County")
nyc_rates$State <- "NY"
nyc_rates %<>%
select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
arrange(FIPS, date)We only consider cases and death in continental US. Alaska, Hawaii, and Puerto Rico have 02, 15, and 72 as their respective first 2 digits of their FIPS. We use the %/% operator for integer division to get the first 2 digits of FIPS. All data of counties in NYC are aggregated as County == "New York City" in covid_rates with no FIPS, so we combine the NYC data into covid_rate.
covid_rates <- covid_rates %>%
arrange(fips, date) %>%
filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
filter(county != "New York City") %>%
rename(FIPS = "fips",
County = "county",
State = "state",
cum_cases = "cases",
cum_deaths = "deaths")
covid_rates$date <- as.Date(covid_rates$date)
covid_rates <- rbind(covid_rates,
nyc_rates)We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).
covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]Merge the TotalPopEst2019 variable from the demographic data with covid_rates by FIPS and calculate the cumulative infection/mortality rates for each county (divide the cumulative cases/deaths by county population). FIPS for Kansas city, Missouri, Rhode Island and some others are missing. We drop them for the moment and output the data up to week 57 as covid_rates.csv.
covid_rates <- merge(covid_rates[!is.na(FIPS)],
people[,.(FIPS = as.character(FIPS),
TotalPopEst2019)],
by = "FIPS",
all.x = TRUE)
covid_rates %<>%
mutate(cum_infection_rate = cum_cases/TotalPopEst2019,
cum_mortality_rate = cum_deaths/TotalPopEst2019)
fwrite(covid_rates %>%
filter(week < 58) %>%
arrange(FIPS, date),
"data/covid_rates.csv")int_datesWe convert the columns representing dates in int_dates to R Date types using as.Date(). We will need to specify that the origin parameter is "0001-01-01". We output the data as covid_intervention.csv.
int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"),
.SDcols = date_cols]
fwrite(int_dates, "data/covid_intervention.csv")NA values in the covid_rates data set correspond to a county not having confirmed cases/deaths. We replace the NA values in these columns with zeros.
covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0Merge the demographic data sets by FIPS and output as covid_county.csv.
countydata <-
merge(x = income,
y = merge(
x = people,
y = jobs,
by = c("FIPS", "State", "County")),
by = c("FIPS", "State", "County"),
all = TRUE)
countydata <-
merge(
x = countydata,
y = county_class %>% rename(FIPS = FIPStxt),
by = c("FIPS", "State", "County"),
all = TRUE
)
# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")